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P-(. Abstract 



Matching pursuits arc a class of greedy algorithms commonly used in signal processing, for solving the sparse 

approximation problem. They rely on an atom selection step that requires the calculation of numerous 

projections, which can be computationally costly for large dictionaries and burdens their competitiveness in 

^^ ■ coding applications. We propose using a non adaptive random sequence of subdictionaries in the decomposi- 

r^ . tion process, thus parsing a large dictionary in a probabilistic fashion with no additional projection cost nor 

parameter estimation. A theoretical modeling based on order statistics is provided, along with experimental 

Ij , evidence showing that the novel algorithm can be efficiently used on sparse approximation problems. An 

application to audio signal compression with multiscale time-frequency dictionaries is presented, along with 

j«^ , a discussion of the complexity and practical implementations. 
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1. Introduction 



The ability to describe a complex process as a combination of few simpler ones is often critical in en- 
gineering. Signal processing is no exception to this paradigm, also known in this context as the sparse 
^^ ■ representation problem. Yet rather simple to expose, its combinatorial nature has led researchers to develop 

so many bypassing strategies that it can be considered a self-sufficient topic. Throughout these years of 
work, many additional benefits were found arising from the dimensionality reduction. Firstly, sparsity allows 
faster processing, and biologically inspired models [37[ suggest that the mammalian brain takes advantage 
of it. Secondly, it helps reducing both storage and broadcasting costs [3J]. Finally, it enables semantic 
characterization, since few significant objects carry most of the signals information. 

The central problem in sparse approximation theory may be written as such: Given a signal / in a Hilbcrt 
space H and a finite-size dictionary $ = {4'i} of M unit norm vectors (V7 G [1..M] , ||</>').|P = 1) in H, called 
atoms, find the smallest expansion of / in $ up to a reconstruction error e : 

M 

minllajlos.t. H/ - ^ a^^^f < e (1) 

7=1 

where |ja||o is the number of non-zero elements in the sequence of weights {a.^}. Equivalently, one seeks the 
subset of indexes F™ = {jn}n=i..m of atoms of $ and the corresponding m non zero weights {a7„} solving: 

minm s.t. ||/-/,„|p < e (2) 
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where fm = J2^=i 0^77. '/'7„ is called a m-term approximant of /. 

A corollary problem is also defined, the sparse recovery problem, where one assumes that / has a sparse 
support r™ in $ (m being much smaller than the space dimension) and tries to recover it (from noisy or fewer 
measurement than the signals dimension) . In the sparse approximation problem, no such assumption is made 
and one just seeks to retrieve the best ?7i-term approximant of / in the sense of minimizing the reconstruction 
error (or any adapted divergence measure). Many practical problems benefit from this modeling, from 
denoising [IJ, IJ] to feature extraction and signal compression [15|, [34 1 . 



Unfortunately, problem ([T|) is NP-hard and an alternate strategy needs to be adopted. A first class of 
existing methods are based on a relaxation of the sparsity constraint using the /i-norm instead of the non- 
convex /q J42I |j| , thus solving a quadratic program. Alternatively, greedy algorithms can produce (potentially 
suboptimal) solutions to problem ([T]) by iterativcly selecting atoms in $. 

Matching Pursuit (MP) [27| and its variants [30|, 122, la, la, |3| are instances of such greedy algorithms. 



A comparative study of many existing algorithms can be found in ll[. MP-like algorithms have simple 
underlying principles allowing intuitive understanding, and efficient implementations can be found [24. 126 1. 
Such algorithms construct an approximant /„ after n iterations by alternating two steps: 

Step 1 Select an atom (f)^^ G $ and append it to the support F" ~ pn-i lj^„ 

Step 2 Update the approximation /„ accordingly. 

until a stopping criterion is met or a perfect decomposition is found (i.e /„ = /). This two-step generic 
description defines the class of General MP algorithms [20J . 

When designing the dictionary, it is important that atoms locally ressemble the signal that is to be 
represented. This correlation between the signal and the dictionary atoms ensures that greedy algorithms 
select atoms that remove a lot of energy from the residual). For some classes of signals (e.g. audio), the 
variety of encountered waveforms implies that a large number of atoms must be considered. Eventually, one 
is interested in finding sparse expansions of signals in large dictionaries at reasonable complexity. 

Methods addressing sparse recovery problems have benefited from random dictionary design properties 
(e.g with MP-like algorithms [43l, l3|). Randomness in this context is used for spreading information that is 
localized (on a few non-zero coefficients in a large dictionary) among fewer measurement vectors. 

In the sparse approximation context though, the use for randomness is less obvious. Much effort has 
been done on designing structured dictionaries that exhibit good convergence pattern at reasonable costs 
|34l . IgI, l23| . Typical dictionaries with limited size are based on Time-Frequency transforms such as window 
Fourier transforms (also referred to as Gabor dictionaries) [27l,ll9|, Discrete Cosine [3^, wavelets [37| or unions 
of both [6[. Such dictionaries are made of atoms that are localized in the time-frequency plance, which gives 
them enough flexibility to represent complex non stationary signals. However the set of considered localization 
reflects, in practice, a compromise. 

Considering all possible localizations is, in theory, possible for discrete signals and dictionaries, however 
it yields very large and redundant dictionaries which raises computational issues. The standard dictionary 
design [27|] considers a fixed subset of localizations is arguably a good option but it can never be optimal for 
all possible signals. Adapting the subset to the signal is then possible, but it introduces additional complexity. 
Finally, one may consider using multiple random subsets, perform multiple decompositions and average them 
in a post-processing step [13[. 

In this paper, an other option is considered: the use of randomly varying subdictionaries at each iteration 
during a single decomposition. The idea is to avoid the additional complexity of adapting the dictionary 
to the signal, or of having to perform multiple decompositions, while still improving on the fixed dictionary 
strategy. 

A parallel can be traced with quantization theory. Somehow, limiting dictionary size amounts to dis- 
cretizing its atoms parameter space. This quantization introduces noise in the decomposition (i.e error due 
to dictionary-related model mismatch). Adaptive quantization reduces the noise and multiple quantization 



allow to average the noise out. But the proposed technique is closer to the dithering technique l44[ and 
more generally to the stochastic resonance theory, which shows how a moderate amount of added noise can 
increase the behavior of many non linear systems |l6l ] . 
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Figure 1: Block diagram for General MP algorithms. An input signal f is decomposed onto a dictionary $ by iteratively 
selecting atoms 4>-y^ (Step 1) and updating an approximant fn (Step 2). 

The proposed method only rehes on a shght modification of step 1 and can therefore be apphed to many 
pursuit algorithms. In this work, we focus on the approximation problem. Experiments on real audio data 
demonstrate the potential benefits in signal compression and particularly in low-bit-rate audio coding. It 
should be emphasized that, as such, it is not well suited for sparse recovery problems such as in the compressed 
sensing scheme [9|. Indeed, searching in random subsets of $ may burden the algorithm ability to retrieve 
true true sparse support of a signal in $. 

The rest of this paper is organized as follows. Section [5] recalls the standard Matching Pursuit algorithm 
and some of its variants. Then, wc introduce the proposed modification of Step 1 in Section [3] and expose 
some theoretical justifications based on a probabilistic modeling of the decomposition process. A practical 
application is presented in Section |4] as a proof of concept with audio signals. Extended discussion on various 
aspects of the new method is provided in Section [5l 



2. Matching Pursuit for Sparse Signal Representations 

2.1. Matching Pursuit Framework 



The Matching Pursuit (MP) algorithm [27[ and its variants in the General MP family [20| all share the 
same underlying iterative two-step structure, as described by Figure [TJ Specific algorithms differ in the way 
they perform the atom selection criteria C (Step 1), and the approximant construction A (Step 2). Standard 
MP as originally defined in [27| is given by: 



argmax|(i?"7,(/)^) 

07 e* 



C($,i?"/) 



(3) 
(4) 



where R"f denotes the residual at iteration n (i.e R"f = / — /„). Orthogonal Matching Pursuit (OMP) [30| 
is based on the same selection criteria, but the approximant update step ensures orthogonality between the 
residual and the subspace spanned by the already selected atoms Vp" = span{(/)^^}-y^gr" 



A 



OMP 



(/,$r") 



PVr^I 



(5) 



where Py is the orthonormal projector onto V. 

The algorithm stops when a predefined criterion is met, either an approximation threshold in the form of 
a Signal-to- Residual Ratio (SRR): 

SRR{n)^lO\og-j^-^ (6) 

or a bit budget (equivalently a number of iterations) in coding applications. 
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Figure 2: Time Frequency grids defined by ^ a redundant monoscale time-frequency dictionary, $2^1 and <I>-j2 , two subsets of 
<&. Atoms from <&-j2 are almost the same as ^-ji but with an additional time and/or frequency offset. 

2.2. Weak MP 

Often, the size of $ prevents the use of the criterion C as defined in[3l especiaUy for infinite-size dictionaries. 
Instead, one can satisfy a weak selection rule Cweak{^, R^ I) = 4>^msak such that: 



l(i?"/,' 



> U 



sup |(i?"/>^) 



(7) 



with < t„ < 1. In practice, a weak selection step is implemented by limiting the number of atomic 
projections in which the maximum is searched. This is equivalently seen as a subsampling of the large 
dictionary $. Temlyakov |4l| proved the convergence when X) ^ "^ °° ^^-^ stability properties have been 
studied by Gribonval et al in [20[. 

Let us denote $ = {4>-y\ a large multiscale time-frequency dictionary with atom parameter 7 living in 
S* X [/ X S a finite subset of M"*" x K^. 5 denotes the set of analysis scales, U is the set of time indexes and 
S the set of frequency indexes (see for example [3J|). Computing C(4>,i?"~^/) requires, in the general case, 
the computation of all atom projections (i?"~^/, c/)^) which can be prohibitive. A weak strategy consists in 
computing only a subset of these projections, thus using a coarse subdictionary $x C $, which can be seen 
as equivalent to a subsampling of the parameter space S* x C7 x S. For instance, time indexes can be limited 
to fractions of the analysis window size. 

Reducing the size of the dictionary is interesting for coding purposes. Actually, the smaller the atom 
parameter space, the cheaper the encoding of atom indexes. Figure [2] shows an example of such dictionary 
subsampling in a time-frequency plane. Each point represents the centroid of a time- frequency atom (j)~f, 
black circles corresponds to a coarse subsampling of the parameter space and constitute the subdictionary 
$xi C $. The full dictionary $ figured by the small red crosses, is 16 times bigger than $11. 

However, the choice of subsampling has consequences on the decomposition. Figure [2] pictures an example 
of a different coarse subdictionary defined by another index subset ^x^ C $. For most signals, the decompo- 
sition will be different when using $xi or ^x^ ■ Moreover, there is no easy way to guess which subdictionary 
would provide the fastest and/or most significant decomposition. Using one of these subdictionaries instead 
of $ implies that available atoms may be less suited to locally fit the signals. MP would thus create energy 
in the time frequency plane where there is none in the signal, an artifact known as dark energy [38| . 



2.3. Higher Resolution and Locally- Adaptive Matching Pursuits 

To tackle this issue, high resolution methods have been studied. First, a modification of the atom selection 
criterion has been proposed by Jaggi et al [22| that not only takes energy into account but also local fit to 
the signal. Other artifact preventing methods have also been introduced (e.g for pre-echo [3J] or dark energy 
[39[). A second class of bypassing strategies are locally-adaptive pursuits. An atom is first selected in the 



coarse subdictionary ^x, then a local optimization is performed to find a neighboring maximum in $. Mallat 
et al proposed a Newton method |27| , Goodwin et al focused on phase tunin g |l 7| , Gribonval tuned a chirp 
parameter 18 1 and Christensen et al addressed both phase and frequency [a, |40|. An equivalent strategy is 



defined for gammachirps in [25| . Locally-adaptive methods in time- frequency dictionaries implement a form 
of time and / or frequency reassignment of an atom selected in a coarse subdictionary. 

These strategies yield representations in the large dictionary at a reduced cost. In many applications, 
the faster residuals energy decay compensates the slight computational overhead of the local optimization. 
In a previous work j28| . we emphasized the usefulness of locally-adaptive methods for shift-invariant rep- 
resentation and similarity detection in the transform domain. Though, the resulting atoms are from the 
large dictionary, i.e their indexes are more costly to encode. Moreover, these method require an additional 
parameter estimation step that may increase the overall complexity. 

2.4- Statistics and MP 

Different approaches have been proposed to enhance MP algorithms using statistics. Ferrando et al [13| 
were (to our knowledge) the first to propose to run multiple sub-optimal pursuits and retrieve meaningful 
atoms in an a posteriori averaging step. A somehow similar approach is introduced in jlO| where the authors 
emphasize a lack of precision of their representation due to the structure of the applied dictionary. To avoid 
these decomposition artifacts, they follow a Monte-Carlo like method, where the set of atom parameter is 
randomly chosen before each decomposition followed by an averaging step. 

Elad et al |12l | used the same paradigm for support recovery, taking advantage of many suboptimal 
representations instead of a single one of higher precision. Similar approaches are adopted within a Bayesian 
framework in 36|, |45|, although the subset selection is a byproduct of the choice of the prior. Their work is 



related to the branch of compressed sensing that uses pursuits and random measurement matrices to retrieve 
sparse supports of various classes of high dimensional signals [43| . A recent work by Divekar et al [7| proposed 
a form of probabilistic pursuit. Such work differs from ours in the sense that it is signal-adaptive and tuned 
for the support recovery problem. 

3. Matching Pursuits with Sequences of Sub dictionaries 

S.l. Proposed new algorithm 

This paper presents a modification of MP which consists in changing the subdictionary at each iteration. 
Let us define a sequence I = {I"}ne[o..m-i] of length m. Each X" is a set of indexes of atoms from <&. At 
iteration n, the algorithm can only select an atom in the subdictionary $X" C $. We call such algorithms 
Sequential Subdictionaries Matching Pursuits (SSMP). 

This method is illustrated by Figure [3l When the sequence I is random, we call such algorithms Random 
Sequential Subdictionaries (RSS) Pursuits. The proposed modification can be applied to any algorithm from 
the General MP family (RSS MP, RSS OMP, etc.). After m iterations, the algorithm outputs an approximant 
of the form dH) 

m — 1 

fni = ^ ^n(t)\^ (8) 

where cfy^^ g <I>x" . 

SSMP can be seen as a Weak General MP algorithm as described in [20|. Indeed, it amounts to reducing 
the atom selection choice to a subset of <&, which defines the weak selection rule ([7]). The originality of this 
work is the fact that we change the subdictionary at each iteration while it is usually fixed during the whole 
decomposition. The sequence of subdictionaries is known in advance and is thus signal independent. 

The sequence I might be built such that limm^-oo [XiZo "^i" ~ *^' although in this case, the modified 
pursuit is not equivalent to a pursuit using the large dictionary $. However, it allows the selection of atoms 
from $ that were not in the initial subdictionary $xo . 
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Figure 3: Block diagram of Sequential Subdictionaries Pursuits (SSMP). Step 1 is modified so as to have the search take place 
in a different subdictionary at each iteration. The dictionary subsampling is controlled by a fixed pre-defined sequence I. 



3.2. A first example 

Let $ be a Gabor dictionary. Each atom in $ is a Gabor atom that can be parameterized by a triplet 
(s, u, ^) G S* X f/ X S defining its scale, time and frequency localization respectively. The continuous-time 
version of such atom is given by: 

1 /t-u 



'>s,a.dt) = -^9 



Ji{t-u) 



(9) 



where g can be a Gaussian or Hann window. A discrete Gabor dictionary is obtained by sampling the 
parameter space S x U x E (and the window g). Let N be the signal dimension and K be the number of 
Gabor scales (si, S2, .., sk)- For each scale Sk, u takes integer values between and N and ^ from to Sfc/2- 
The total size of $ in this setup is M = X]fe=i ^\~ atoms. 

Let / G M be a m-sparse signal in $, meaning that / is the sum of in (randomly chosen) components 



We are interested in minimizing the reconstruction error e 



ll/-/n||- 



m 



given the 



from the full dictionary $. 
size constraint n = m/2. 

Let $10 be a subdictionary of $ defined by selecting in each scale Sk only atoms at certain time lo- 
calizations. Subsampling the time axis by half the window length is standard in audio processing (i.e. Vm 
3p G [0, \2N / Sk\ — 1], s.t. u = p.Sk/i). $xo can be seen as a subsampling of $. The size of $io in this setup 
is down to K X N atoms. MP using $xo during the whole decomposition is a special case of SSMP where all 
the subdictionaries are the same. We label this algorithm Coarse MP. 

Now let us define a pseudo-random sequence of subdictionaries $i»i that are translated versions of $io . 
Each 4>x" has the same size than ^jo but its atoms are parameterized as such: Vu 3p G [0, [2N/sk\ — 
1], s.t. u = p.Sk/2 + Tj} where r^ € [— Sfe/4, Sfe/4] is a translation parameter different for each scale Sk of each 
subdictionary $x" ■ Each $x" can also be seen as a different subsampling of 4>. The size of each subdictionary 
$X" is thus also K x N . MP using the sequence of subdictionaries $x" is labeled Random SSMP (RSS MP). 

Finally, a MP using the full dictionary $ during the whole process is labeled Full MP. We compare the 
reconstruction error achieved with these three algorithms (i.e Coarse MP, RSS MP and Full MP). 

Figure m gives corresponding results with the following setting: A^ = 10000, m = 300, S = [32, 128,512] 
and thus M — 840000, averaged over 1000 runs. Although RSS MP is constrained at each iteration to choose 
only within a limited subset of atoms, it exhibit an error decay close to the one of Full MP. 

If / was exactly sparse in $xO; then keeping the fixed subdictionary (Coarse MP) would provide a faster 
decay than using the random sequence (RSS MP). However in our setup, such case is unlikely to happen. As 
stressed by many authors cited above, the choice of a fixed subdictionary is often suboptimal with respect to 
a whole class of signals, as it may not have the basic invariants one may expect from a good representation 
(for instance, the shift invariance). A matter of concern is now to determine when RSS MP performs better 
than Coarse MP in terms of approximation error decay. 

In the next subsection we make use of order statistics to model the behavior of the two strategies (i.e. MP 
with fixed coarse subdictionary and MP with varying subdictionaries) depending on the initial distribution 
of projections. 




Figure 4: Evolution of normalized approximation error e with Matching Pursuit using a fixed subdictionary (Coarse MP), a 
pseudo random sequence of subdictionaries (RSS MP) and the full original dictionary (Full MP). Results averaged over 1000 
runs. 

3.3. Order Statistics 

We need to introduce a few tools from the order statistics theory. The interested reader can refer for 
example to [23 . l2l| for more details. Let zi,Z2,..2:„ be n i.i.d samples drawn from a continuous random 
variable Z with probability density fz and a distribution function Fz. Let us denote Zi:„, Z'2:n, ■•, -Z^mn 
the order statistics. The random variable Zi-n represents the i*^ smallest element of the n samples. The 
probability density of Zi-n is denoted f^^-^ and is given by: 



i^W = 



{n-i)\{i- 1)! 



Fz{zY-'fz{z){l-Fz{z)r 



(10) 



The density of extremum values is easily derived from this equation, in particular the maximum value of the 
sequence has density: 

fln{^)^nFz{zr-'fz{z) (11) 



The moments of the order statistics arc given by the formula: 

/oo 
z"^fL{z)dz 
-OO 

And the variance is denoted cr^^.^^. For convenience we will write the expectation ^i-^ — f^l-n- 



(12) 



3.3.1. Order statistics in an MP framework 

Given the greedy nature of MP, a statistical modeling of the series of projection maxima gives meaningful 
insights about the overall convergence properties. Let / be a signal in R^. At the first iteration, the projection 
of the residual R^ f = f over a complete dictionary $ of M unit normcd atoms {(/)i}jg[i ^/j {M > N) is given 
by: ' ' 

V^e [l..M],a, = (i?O/,0.) (13) 

Let us denote Zi ~ \ai\ and consider them as M i.i.d samples drawn from a random variable Z living in 
[0, vll/IP]- Note that such an assumption holds only if one considers that atoms in $ arc almost pairwise 
orthogonals, i.c that $ is quasi- incoherent. MP selects the atom (f>~f„ = argmaxz^ whose weights absolute 
value is given by the M*'* order statistics of Z: Zm-j 



-.M- 



ho\ = max |(i? f,<j).y/\ = z^M-.M 



Z. 



(14) 



Standard MP constructs a residual by removing this atoms contribution and iterating. Hence, weights of 
selected atoms remains independent and at iteration n the M — n-th order statistics of Z describes the selected 
weight: 

\a~fj = max |(i?"/, 0^)1 = Zm^u-.m (15) 

the energy conservation in MP allows to derive P^ . 



n-l 



i/ip = iii?"/ir + EKP (16) 



?:=o 



Combining (fT6)) and (fTS]) and taking the expectation, one gets (fTT)) 

n-l 

E(P"/f) = l|/lP-EAl.:M (17) 

We recognize the second order moment of fJ-M^i-M ~ '^M-i-M + f^\i-i-M- Similarly, from (|16p we can derive 
the variance of the estimator: 

n— 1 n— 1 

Var{\\Ry\\') = J2J2 co«(^M-.M, ^m-,:m) (18) 

i=0 j"=0 



Given a distribution model for Z, Equations (J17p and (jlSp provide mean and variance estimates of the 
residuals energy decay with a standard MP on a quasi- incoherent dictionary $. 

3.3.2. Redrawing projection coefficients: changing the dictionaries 

The idea at the core of the new algorithm described in section 13.11 is to draw a new set of projections 
at each iteration by changing the dictionary in a controlled manner. Let us now assume that we know a 
sequence of complete quasi-incoherent dictionaries {$i}ie[o.n]- Let us make the additional assumptions that 
the projection of / has the same distribution in all these dictionaries. 

At the first iteration, the process is similar, so the atom (f)^^ in $o is selected as argmax^^g^^ | (R^ f, (p^) \ 
with the same weight as (|14p . After subtracting the atom we have a new residual R^f. The trick is now to 
search for the most correlated atom in $i. The absolute values of the corresponding inner products define a 
new set of M i.i.d samples z}: 

yi€[l..M],zl^\a,\^\{R'f,<j,,)\ (19) 

Let us assume that the zj samples have the same distribution law than the Zi, except that the sample space 
is now smaller because R^f has less energy than /. This means the new random variable Zl is distributed 
like Z X " „j:i " . Let us denote ZIm-.m the M-th order statistic for Zl, its second moment is: 



11/11 



E{Zll.,,j) = E{Zl.,,,).E{^^fJ^) + cov{Zl.,,j, ^^) (20) 



I./P 



and we know that ||i?-'^/|p = ||/|p — Z^j.j^j hence 



M:M 

2 cr2 ^ _ /,,(2) ^2 ,,(4) 



which leads to: 



COviZJj.^^l, \\R^ff) = -COv{Zl;.^M, ZlnAl) = {AimY - t^M:M (21) 



(2) 1 

j(y,2 ^ _ (2) ., MmjUn , 1 // (2) ^2 (4) \ 

A^^M:M) — t^M:M-\^ II f\\2 I + II f\\2 \\t^M:M) ~ ^^1/:!/^ 



(4) 
,,(2) _ t^M:M 
t^M-.M ||j-||2 



Since the new residual R^f has energy |j_R'^/jp = ||i?,^/jp — Zl|^.^j, taking the expectation: 

(4) 

mR'm^ur-^AI-.M + jfp- m) 

and we see higher moments of the M-th order statistic appear in the residuals energy decay formula. 
At the n-th iteration of this modified pursuit, we end up with a rather simple formula (j23|) where higher 
moments of the M-th order statistic of Z appear. 

n . ^ (2i) 

E(||i?"7f ) = II/IP + Y.(-^y [,) Jjfw^ (23) 

Similarly, a variance estimator can be derived that only depends on the moments of the highest order statistic: 

Varm-m = E E(-l)^"^ (I) ( ■) " ||;|^;:^^i)" (24) 

3.4- Comparison with simple m,odels 

Let us now compare the behavior of the new redrawn samples strategy to the one where they arc fixed. 

The relative error t{n) ~ 10 log " y ,.| | j' after n iterations is computed for both strategies. Knowing 
the probability density function (pdf) /z, Equations (fT7| to (f24|) provide closed-form mean and variance 
estimate of e(n) for both strategies. For example if Z ~ Z-/(0, 1), its order statistics follow a beta distribution: 

Zk:M "^ Beta{k, Af -I- 1 — k) and all moments IJ-M-k-M '^^^ ^^ easily found. 

A graphical illustration of potential of the new strategy is given by Figure O For three different distri- 
butions of Z , namely uniform, normal (actually half normal since no negative values are considered) and 
exponential. The density function of two order statistics of interest (i.e the maximum ffj.j^j, and the il//2-th 
element ff;/2-M) ^^ plotted. These two elements combined provide an insight on how fast the expected value 
of selected weight will decline using the fixed strategy. 

Some comments can be made: 

• In the uniform case: one might still expect to select relatively large coefficients (i.e good atoms) after 
M/2 iterations with the standard strategy. In this case, redrawing the samples (i.e changing the 
subdictionary) appears to be detrimental to the error decay rate. 

• For normally and exponentially distributed samples, the expected value of a coefficient selected in a 
fixed sequence drops more quickly towards small values (i.e there are relatively fewer large values). 
Relative error profiles indicate that selecting the maximum of redrawn samples can prove beneficial in 
terms of minimizing the reconstruction error. It is promising for sparse approximation problems such 
as lossy compression. 

In practice, the uniform model is not well suited for sparse approximation problems. It basically indicates 
that the chosen dictionary is poorly correlated with the signal (e.g. white noise in Dirac dictionaries). 
Exponential and normal distributions, on the opposite, are closer to practical situations where the signal is 
assumed sparse in the dictionary, with additional measurement noise such as the one presented in section 
13.21 In these cases, changing the subdictionaries seems to be beneficial. 

Let us again stress that we have only presented a model for error decays with two strategies and a set of 
strong assumptions. It is not a formal proof of a guaranteed faster convergence. However, it gives insight on 
when the proposed new algorithm may be useful and when it may not, along with estimators of the decay 
rates when a statistical modeling of signal projections on subdictionaries is available. To our knowledge, such 
modeling was not proposed before for MP decompositions. 
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Figure 5: Left : Pdf of Z and pdfs for the maximum and the A//2-th order statistic of Z for uniform, normal (a = 1) and 
exponential {fj, = 1) distributions. Right: relative error decay (mean and variances) as a function of iteration number. M = 100 



10 



Original Signal 



Matching Pursuit 



Yn 












K}. 


Quantization 

Q 


{Q{0^^)} 







Entropy 
Coder 



BitStream 



n-term approximant 



fn = Er=0 Q{ai)4>-ji 

quantized n-term 
approxinnant 



Figure 6: Block diagram of encoding scheme: each atom selected by the pursuit algorithm is transmitted after quantizing its 
associated weight. A simple entropy coder then yields the bitstream 

4. Matching Pursuit with Random Time Frequency Subdictionaries for audio compression 

4.I. Audio compression with union of MDCT dictionaries 

Sparse decompositions using Matching Pursuit have been proven by RaveUi et 



34| to be competitive 



for low bit-rate audio compression. In spite of Gabor dictionaries, they use the Modified Discrete Cosine 
Transform (MDCT) [H that is at the core of most transform coders (e.g. MPEGl-Layer III [ij, MPEG 
2/4 AAC). Let us recall the main advantage of RSSMP compared to Coarse MP: reconstruction error can 
be made smaller with the same number of iterations. If the random sequence of subdictionaries is already 
known by both coder and decoder then the cost of encoding each atom is the same in both cases. 

For this proof of concept, a simplistic coding scheme is used as presented in Figure |6l For each atom 
in the approximant, the index is transmitted alongside its quantized amplitude (using a uniform mid-tread 
quantizer) . The cost of encoding an atom index is log2 ( M ) bits where M is the size of the dictionary in which 
the atom was chosen. A quantized approximation /„ of / is obtained after n iterations with a characteristic 
SNR (same as SRR but computed on /„ instead of /„) and associated bit-rate. 

The signals arc taken from the MPEG Sound Quality Asscsment Material (SQAM) datasct, their sampling 
frequency is reduced to 32000 Hz. Three cases arc compared in this study: 

Coarse MP Matching Pursuit with a union of MDCT basis with no additional parameter. The MDCT 
basis have sizes from 4 to 512 ms (8 dyadic scales from 2^ to 2-^^ samples) hop size is 50% in each scale. 

LoMP Locally Qptimized Matching Pursuit (LoMP) with a union of MDCT Basis. This is a locally- 
adaptive pursuit where an atom is first selected in the coarse dictionary. Then its time localization is 
optimized in a neighbourhood to find the best local atom in the full dictionary. This pursuit is a slightly 
suboptimal equivalent of a Matching Pursuit using the full dictionary (Full MP). It is nevertheless less 
computationally intensive. However, it requires the computation (and the transmission) of an additional 
parameter (the local time shift) per atom (see [28[ for more details). 

RSS MP MP with a random sequence of subdictionaries. Each subdictionary is itself a union of MDCT 
basis, each basis of scale Sk is randomly translated in time by a parameter r^ G [— Sfc/4 : Sfc/4]. It is 
worth noticing that each scale is independently shifted. The sequence T = {Tl}k=i..K,i=i..n is known 
in advance at both encoder and decoder side, i.e. there is no need to transmit it. 

Step SSMP and Jump SSMP arc two deterministic SSMP algorithms that are further described in Section [5l 

4-2. Sparsity results 

Figure [7] shows the number of iterations needed with the three described algorithm (and two variants 
discussed in 15. 1|) to decompose short audio signals from the MPEG SQAM datasct ^ (a glockenspiel, an 
orchestra, a male voice , a solo trumpet and a singing voice), at 10 dB of SRR. One can verify that RSS MP 
yields sparser representations (fewer atoms atoms are needed to reach a given SRR) than Coarse MP. This 
statement remains valid at any given SRR level in this setup. 

Experiments where run 100 times with the audio signal being randomly translated in time at each run. 
Figure [7] shows that empirical variance remains low, which gives us confidence in the fact that RSS MP will 
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Figure 7: Number of iterations needed to achieve 10 dB of SRR. Means and standard deviation for various 4 seconds length 
audio signals from the MPEG SQAM dataset with MP on various fixed and varying sequences of subdictionaries and random 
initial time offset. Step SSMP and Jump SSMP are two deterministic SSMP. 

give sparser representations in most cases. So far we have not found a natural audio example contradicting 
this. 

4-3. Compression results 

Figure [7] shows that the locally-adaptive algorithm (LoMP) is better in terms of sparsity of the achieved 
representation than RSS MP. However, each atom in these representations is more expensive to encode. 
Actually, one must transmit an additional local time shift parameter. With RSS MP no such parameter 
needs to be transmitted. This gives RSS MP a decisive advantage over the two other algorithms for audio 
compression. 

As an example, let us examine the case of the last audio example labeled vega. This is the first seconds 
from Susan Vega's song Tom's Dinner. The three algorithms are run with a union of 3 MDCT scales (atoms 
have lengths of 4, 32 or 256 ms). Again these scores have been averaged over 100 runs with the signal being 
randomly offset at each run and showed very little empirical variance. In order to reach e.g. 20 dB of SRR 
(i.e before weight quantization): 

• Coarse MP selected 6886 atoms. If each has a fixed index coding cost of 18 bits and weight coding cost 
of 16 bits, a total cost of 231 kBits would be needed. 

• LoMP selected 3691 atoms. With same index and weight cost plus an additional time shift parameter 
whose cost depends on the length of the selected atom, the total cost would be 149 kBits. 

• RSS MP selected 3759 atoms (with empirical standard deviation of 16 atoms). Each atom has the same 
fixed cost as in the Coarse MP case. The average total size is thus 126 kBits. 

Using an entropy coder as in Figure |6] does not fundamentally change these results. Figure |8] summarizes 
compression results with Coarse MP, LoMP and RSS MP for various audio signals from this dataset. Although 
these comparisons do not use the most efficient quantization and coding tools, it is clear from this picture that 
the proposed randomization paradigm can be efficiently used in audio compression with greedy algorithms. 
In this proof of concept, the quality measure adopted was the SNR. However, such algorithms and 
dictionaries arc known to introduce disturbing ringing artifacts and pre-ccho. We have not experienced that 
the proposed algorithm increased nor reduced these artifacts compared to the other pursuits. Furthermore, 
pre-echo control and dark energy management techniques can be applied with this algorithm just as with 
any other. Audio examples are available onlinqj. 
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Figure 8: SNR/Bitrate curves for ds-length signals from the MPEG SQAM dataset with multi-scale MDCT dictionaries ClJ ,2^^ 
and 2^^ samples per window). 
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5. Discussion 

Having presented the novel algorithm and an application to audio coding, further questions are inves- 
tigated in this section. First an experimental study is conducted to illustrate why the sequence of subdic- 
tionaries should be pseudo-randomly generated. Then, the RSS scheme is applied to orthogonal pursuit and 
finally a complexity study is provided. 

5.1. Random vs Deterministic 

In the above experiments we have constructed a sequence of randomly varying subdictionarics by designing 
a sequence of time shifts T = {Tl}k=i..K.i=i..n by means of a pseudo-random generator with a uniform 
distribution: V(fc, i)r^. ^ Z-/(— Sfe/4, Sfc/4). There are other ways to construct such a sequence, some of which 
are deterministic. However, we have found that this pseudo-random setup is interestingly the one that gives 
the best performances. 

To illustrate this, let us recall the setup of Section|4]and compare the sparsity of representations achieved 
with sequences of subdictionarics built in the following manner: 

RANDOM: T is randomly chosen: V(A:,i)T^ -^^(-5^/4,5^/4) 

STEP: T is constructed with stepwise increasing sequences : V(fc, i)r^. = mod{Tl,~^ + 1, Sfe/2) and V/c, r^ = 0. 
This sequence yields dictionaries that are quite close from one iteration to the next. 

JUMP: T is constructed with jumps : V(fc, i)T^ = mod{Tlr^ + Sk/4: - l,Sfc/2)and Vfc,T^ == 0. This se- 
quence yields dictionaries that are very dissimilar from one iteration to the next, but are quite close to 
dictionaries 2 iterations later. 

Figure [7] shows that among all cases, the random strategy is the best one in terms of sparsity. The worst 
strategy is STEP. This can be explained by the fact that, from one iteration to the next, the subdictionarics 
are well correlated. Indeed all the analysis windows arc shifted simultaneously by the same (little) offset. 
The JUMP strategy is already better. Here the different analysis scales are shifted in different ways. Low 
correlation between successive subdictionarics appears to be an important factor. Experiments were run 100 
times and the hierarchical pattern remain unchanged for higher SRR levels. 

Many more deterministic strategies have been tried, the random strategy remains the most interesting 
one when no further assumption about the signal is made. We explain this phenomenon by noticing that 
pseudo random sequences are more likely to yield subdictionarics that are very uncorrelated one to another 
not only on a iteration-by-iteration basis but also for a large number of iterations. 

5.2. Orthogonal pursuits 

Other SSMP family members can be created with a simple modification of their atom selection rule. In 
particular, one may want to apply this technique to Orthogonal Matching Pursuit. The resulting algorithm 
would then be called RSS OMP 

To evaluate RSS OMP, we have recreated the toy experiment of [3[ and use 1000 random dictionaries of 
size 128 X 256 with atoms drawn uniformly from the unit sphere. Then, 1000 random signals are created using 
64 elements from the dictionaries with unit variance zero mean Gaussian amplitudes. For each signal and 
corresponding dictionary, the decomposition is performed using MP and OMP with a fixed subdictionary of 
size 128 X 64 (Coarse MP and Coarse OMP), a random sequence of subdictionarics of size 128 x 64 (RSS MP 
and RSS OMP), and the complete dictionary (FuU MP and Full OMP). Matlab/Octave code to reproduce 
this experiment is available online. 

Averaged results are shown Figure El Fixed subdictionary cases (Coarse MP/OMP) show a saturated 
pattern caused by incompleteness of the subdictionary. Although working on subdictionarics four times 
smaller, pursuits on random sequential subdictionarics exhibit a behavior close to pursuits on the complete 



1 |http://www.tsi.telecom-paristech.fr/aao/?p=531| 
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Figure 9: Comparison between algorithm MP (thin) and OMP (bold) with fixed coarse subdictionary i^o (dashed dotted blue), 
random sequential subdictionaries <l?x" and full dictionary <J> (dashed red). Input signal is a collection of N/2 vectors from # 
with normal weights. Results averaged over 1000 runs. Here N = 128, M = 256 
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Table 1; Complexities of the different steps for various algorithm in the general case (no update trick for projection). N is the 
signal dimension. M is the number of atoms in <3> and L the number of atoms in subdictionaries. The iteration number i is 
always smaller than N. 

dictionary. Potential gains are even more visible with the OMP algorithm, where the probabilistic parsing 
of the large dictionary allows for a much better error decay rate than the fixed subdictionary case (Coarse 
OMP). 

5.3. Computational Complexities 

5.3.1. Iteration complexities in the general case 

Let / be in M.^ and $ be a redundant dictionary of Af atoms also in R^. Let <i>x be a subdictionary 
of $ of L atoms. Complexities in the general case are given at iteration i by Table [T] When no update 
trick is used, Step 1 requires the projection of the A^ dimensional residual onto the dictionary followed by 
the selection of the maximum index. Step 2 is rather simple for standard MP, it only involves an update 
of the residual through a subtraction of the selected atom. For OMP though, a Gram Matrix computation 
followed by an optimal weights calculation is needed to ensure orthogonality between the residual and the 
subspace spanned by all previously selected atoms. Complexities of this two additional steps increases with 
the iteration number and has led to the development of bypassing techniques that limit complexity using 
iterative QR factorization trick [3|. Complexities in Table [T] are given according to these tricks. 

5.3.2. Update tricks and short atoms 

Although RSS MP has the same complexity in the general case as Coarse MP, there are practical lim- 
itations to its use. First, Mallat et al [27| proposed a simple update trick that effectively accelerates a 
decomposition. Projection step is performed using previous projection values and the pre-computed inner 
products between atoms of the dictionary. Changing the subdictionary on an iteration-by-iteration basis 
prevents from using this trick. 
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Table 2: Complexities after iteration 1 of the different steps for various algorithm in structured time-frequency dictionaries with 
fast transform and limited atomic support P << A'^. In this case, the local update speed up trick can not be applied to RSS 
MP. 

An even greater optimization is available when using fast transforms and more specifically when only 
local updates are possible. One of the main accelerating factor proposed for MP in the MPTK framework 
|24| and for OMP in [26[ is to limit the number of projections that require an update to match the support 
of the selected atom. Again, when changing dictionaries at each iteration, this trick can no longer be used 
to accelerate Step 1. 

Using fast transforms, the complexity of the projection step at first iteration reduces to 0(A/logiV) for 
Full MP [23. Additionally if one uses short atoms of size P << iV, it further reduces to 0[MlogP). When 
local update can be used instead of full correlation, the projection step after iteration 1 has an even more 
reduced complexity of 0{aP log P) in $1 where a = -^ is the redundancy factor and of 13 P log P in $ where 
P = ^- = a^ . Regardless of the dictionary, the residual only needs a local update around the chosen atom. 

Table[2]shows that in this context, the proposed modification becomes less competitive with respect to the 
standard algorithm on fixed subdictionary and can even be slower than the full dictionary case if L > ^-^■ 
However, it can still be considered as a viable alternative since memory requirements for full dictionary 
projections can get prohibitive. 

In order to limit the computational burden, it is possible to change the subdictionary only once every 
several iterations. If the subdictionary $xi remains unchanged for J consecutive iterations, then the usual 
tricks for speeding up the projection can be applied. When $x> changes at iteration i + J, all projections 
need be recomputed. Although it can accelerate the algorithm, one may expect a resulting loss of quality. 

5.3.3. Sparsity/ Complexity tradeoff 

Another way to tackle the complexity issue is by choosing an appropriate size for the subdictionaries. 
Since RSS MP yields much sparser solutions with dictionaries of the same size than Coarse MP, we can 
further reduce the size of the subdictionaries in order to accelerate the computation while still hoping to get 
compact representations. Indeed, in the discussion above, we have only compared complexities per iteration, 
but the total number of iterations is also important. 

To illustrate this idea, we have used subsampled subdictionaries with a subsampling factor varying from 1 
(no subsampling) to 32. The subsampling is performed by limiting the frame indexes in which we calculate the 
MDCT projections. It is important to notice that the subdictionaries are no longer redundant, and not even 
complete when the subsampling factor is greater than 2. Figure [10] shows for two signals how the subsampling 
impacts the sparsity of the representation (here expressed as a bit-rate calculated as in section |4]) and the 
computational complexity relative to the reference Coarse MP algorithm for which we have implemented the 
local update trick. 

Dictionaries used in this setup are multiscale MDCT with 8 different scales (from 4 to 512 ms), the 
experiment is run 100 times on a dual core computer up to a SRR of 10 dB. Using random subdictionaries 
10 times smaller than the fixed one of Coarse MP, we can still have good compression ( cost 30 % less than 
the reference) with the computation being slightly faster. 

More interestingly, this subsampling factor, associated with the RSS MP algorithm, controls a spar- 
sity /complexity tradeoff that allows multiple use cases depending on needs and resources. 
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Figure 10; Illustration of the sparsity/ complexity tradeoff that can be achieved by controlling the size of the subdictionaries in 
RSS MP. 

6. Conclusion 

In this work, we proposed a inodificatioii of greedy algorithms of the MP family. Using a pseudo random 
sequence of subdictionaries instead of a fixed subdictionary can yield sparser approximations of signals. This 
sparsity basically comes at no additional coding cost if the random sequence is known in advance, thus giving 
our algorithm a clear advantage for compression purposes as shown here for audio signals. On the downside, 
the modification may increase complexity, but we have proposed a tradeoff strategy that helps reducing 
computation times. 

The idea presented in this work can be linked to existing techniques from other domains, where randomness 
is used to enhance signal processing tasks. Dithering for quantization is one such technique. Another example 
is spread spectrum in communication [8|, where a signal is multiplied by a random sequence before being 
transmitted on a bandwidth that is much larger than the one of the original content. Here, encryption 
is more important a goal than compression. However, performing RSS MP of a signal somehow requires 
the definition of a key (the pseudo-random sequence) that must be known at the reception side in order 
to decode the representation. Moreover, the representation may also live in a much larger space than the 
original content. It is worth noticing for instance the work of Puy et al [33| where the authors apply spread 
spectrum techniques to the compressed sensing problem. 

Other experiments should be conducted for feature extraction tasks, since the good convergence properties 
indicates that the selected components carry meaningful information. For instance, music indexing |35| and 
face pattern recognition |3l| use Matching Pursuits to derive the features that serve for their processing. 
Further work will have to investigate whether RSS MP can improve on the performance of other algorithms 
on such tasks. 
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